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Abstract. Questions in computational molecular biology generate various 
discrete optimization problems, such as DNA sequence alignment and RNA 
secondary structure prediction. However, the optimal solutions are fundamen¬ 
tally dependent on the parameters used in the objective functions. The goal 
of a parametric analysis is to elucidate such dependencies, especially as they 
pertain to the accuracy and robustness of the optimal solutions. Techniques 
from geometric combinatorics, including polytopes and their normal fans, have 
been used previously to give parametric analyses of simple models for DNA se¬ 
quence alignment and RNA branching configurations. Here, we present a new 
computational framework, and proof-of-principle results, which give the first 
complete parametric analysis of the branching portion of the nearest neigh¬ 
bor thermodynamic model for secondary structure prediction for real RNA 
sequences. 


1. Motivation 

Over the past four decades, improvements in biotechnology have greatly acceler¬ 
ated the amount of biological sequence data available. Yet, a fundamental challenge 
in computational molecular biology remains to reliably infer functional information 
from the linear encoding of DNA, RNA, and protein molecules. 

As articulated in the central dogma of molecular biology, genetic information is 
stored in DNA from which it is transcribed into messenger RNA, and then trans¬ 
lated into proteins by ribosomal and transfer RNA. However, as always in biology, 
a wealth of complexity lurks below the surface of this basic principle. Historically, 
most interest was focused on DNA sequences (as the cellular genome) and protein 
structures (as the cellular machinery). Since the early 2000’s, though, attention has 
increasingly turned to RNA as many more critical functions have been revealed, in¬ 
cluding gene splicing, editing, and regulation. 

Like DNA, RNA is a sequence of nucleic acids, abbreviated A, C, G, and U (instead 
of T), which form the familiar Watson-Crick pairings. Unlike the canonical double- 
stranded DNA helix, most RNA molecules are naturally single-stranded and the 
intra-sequence base pairings are an integral component of the three-dimensional 
structure. This is in contrast to the more subtle amino acid interactions which 
govern the formation of protein structures. However, knowing the structure of a 
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protein or RNA molecule is critical to understanding and manipulating its cellular 
functions. 

The structure of an RNA molecule is understood hierarchically, from the lin¬ 
ear biochemical chain through the planar set of canonical base pairings to the 3D 
molecule, whose structure includes more complicated tertiary interactions like pseu¬ 
doknots and base triples. Given the prevalence of RNA sequence data and the dif¬ 
ficulties, both experimental and computational, of determining 3D structures, the 
majority of attention has focused on the set of base pairs, known as the secondary 
structure. In particular, one seeks to answer: What are the native base pairs for a 
given RNA sequence? As explained below, it is possible to efficiently compute an 
optimal secondary structure under the current Nearest Neighbor Thermodynamic 
Model (NNTM). However, when these minimum free energy (MFE) predictions 
are compared to structures derived from information-theoretic means, the current 
gold-standard, the average accuracy for longer ribosomal RNA sequences is only 
40%. Hence, it is critical to understand which aspects of RNA base pairing are not 
captured well by the NNTM. 

One known weakness of the current model is the energy function which governs 
the branching of an RNA secondary structure. For computational reasons, the 
entropic cost is modeled as an affine function with three parameters. A very nat¬ 
ural question to ask is: How does the optimal secondary structure depend on the 
branching loop parameters? 

Such a question is an example of a parametric analysis, which illuminates the 
dependencies of the solution on the underlying optimization parameters. As we 
illustrate here, methods from geometric combinatorics, specifically polytopes and 
their normal fans, can be used to answer this question, and others including accu¬ 
racy and stability. The outline of this paper is as follows. Section contains the 
basic RNA terminology and a brief overview of the geometric tools needed for para¬ 
metric analysis. In Section we first give background on the NNTM, specifically 
the way the multibranch loops are scored. Then we introduce branching polytopes 
and explain how to compute them using the software pmfe which we have made 
available online. Finally, we present a few biological questions and discuss math¬ 
ematical problems related to the branching polytopes they raise. The paper ends 
with a summary in Section 0. 


2. Background 

2.1. Computational molecular biology. Nucleotide sequences can be modeled 
as strings over a finite alphabet. The questions related to DNA and RNA are 
amenable to analysis using discrete mathematical techniques from algebra, combi¬ 
natorics, and topology. We refer the reader to the following references for mathemat¬ 
ical approaches for studying computational molecular biology 0 [Ml 123 [HIM IS- 
Of particular interest here are techniques from geometric combinatorics, which can 
inform parametric analysis of models arising from optimization questions, such as 
DNA sequence alignment and RNA secondary structure prediction. 

Previous parametric analyses in molecular biology have primarily focused on 
DNA. Given sequences of DNA, an important problem is to find regions that have 
been evolutionary conserved. The first step requires aligning the sequences by 
adding spaces to the sequences until they are the same length. The DNA se¬ 
quence alignment problem is to find the best alignment between sequences given 
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some scoring function [35]. The optimal solution can be determined via dynamic 
programming optimization which scores various features and provides global align¬ 
ment. The optimization fundamentally depends on parameters, and the notion of 
‘best matching’ sequences may depend on the value of these parameters. Techniques 
from geometric combinatorics can be used [35115] to analyze systematically how se¬ 
quence alignment depen ds o n these parameters. The mathematical background will 
be presented in Section [ 2 !^ . 

The problem we focus on here is prediction of the secondary structure of a 
single given RNA sequence. Unlike DNA, which typically forms a fully base-paired 
double helix, RNA typically exists as a single strand whose bases can pair with 
one another to form diverse and complex secondary structures. The secondary 
structure of a given RNA molecule serves as a scaffold for the 3D structure m- 
As shown in Fig. [l|, an RNA secondary structure consists of stacked base pairs 
(helices) and single-stranded regions (loops). The degree of a loop is the number of 
helices emerging from it (or, equivalently, the number of base pairings in the loop). 
Hairpin loops have degree 1, internal loops have degree 2, and multibranch loops 
have degree greater than 2; each RNA molecule also has a (possibly empty) exterior 
loop, which includes the unpaired bases not contained in other loops. Figure[l| shows 
a secondary structure with all of these substructures labeled. 
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Figure 1. Secondary structure of RNA with substructures. 


The base pairings in the secondary structure are difficult to determine experimen¬ 
tally, necessitating mathematical models to predict the structure [40] . Indeed sec¬ 
ondary structure prediction has been an active area of research for several decades 
(see survey in [IS]). One approach for predicting secondary structure is calculat¬ 
ing the free thermodynamic energy of the RNA molecule according to the standard 
NNTM [T7|[33[. In this model, bases that bond are generally energetically favorable 
and stabilize the RNA with a negative free energy, whereas unpaired bases are en¬ 
ergetically disfavorable and destabilize the RNA with a positive free energy. Each 
substructure is assigned an energy value, and the total energy of the secondary 
structure is determined by summing over its substructures [HI [33]. 

If we further assume that secondary structure has no crossing base pairs, we 
could then generate secondary structure of an RNA sequence recursively over its 
subsequences. This allows us to compute the minimum energy structure efficiently 
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using a dynamic programming algorithm. This approach is widely studied, and 
several software packages are available to perform the appropriate computations; 
widely-used implementations include RNAfold from the Vienna RNA package m, 
mfold [IT], and RNAstructure [55]. However, the optimal structure found using the 
thermodynamic model is often not the correct one. The prediction depends on the 
thousands of parameter values in the objective function, affecting the substructure 
prediction. 

Current versions of the scoring model iniiss] have over eight thousand param¬ 
eters, but we focus our attention on three related to multibranch loops such as 
that at the center of Fig. [l|. (We will discuss some details of the model and its 
parameters in Section [Sj].) We perform parametric analysis using geometric com¬ 
binatorics, which has previously been applied to a simplified tree model of RNA 
base pairing that focuses on branching configurations [M] , and extend this to study 
the branching of real RNA sequences under the Turner99 NNTM parameters. In 
the next section we introduce the required combinatorial and geometric notions to 
carry out this parametric analysis. 

2.2. Polytopes and geometric combinatorics. The optimization of the NNTM 
objective function, with an emphasis on branching loop calculations, can be pre¬ 
sented as a simple linear programming (LP) problem, and thus well-established 
mathematical tools can be used to tackle the optimization problem parametrically. 
To analyze optimization problems with large sets of feasible solutions, such as all 
possible secondary structures that could form from a particular sequence of RNA, a 
(relatively) concise description of the possibilities is required. This can be achieved 
by constructing the convex polytope associated with the LP problem at hand. 

The theory of convex polytopes and normal fans is well-developed, so we will 
not reproduce it here. Rather we present just the definitions needed for a rigorous 
description of the results in Section [^. More information on these applications of 
polytopes is given in EllSIlIl]; readers wishing to gain a thorough understanding 
of the theory can see [HI [321 Ej ■ 

The theory of linear programming starts from the following question: Given a 
set of feasible solutions X in and a vector A S M", for which x G A is the 
objective function Ax minimized? 

For a fixed A, this is a straightforward question; however as A varies, different 
vectors in the set X will minimize Ax. If A is a large or infinite set it may be 
simpler to deal with a geometric region containing A, than with the original set 
of feasible solutions itself. A convex polytope is a bounded region in K” which can 
be described by a finite number of linear inequalities a^x < c^. A hyperplane is a 
supporting hyperplane for a polytope P if it intersects P non-trivially but no interior 
point of P is on the hyperplane. A k-face of P is a fc-dimensional intersection of P 
with a supporting hyperplane. A 0-face is a vertex of P, a I-face is an edge, and so 
on, with the n — 1 dimensional faces called facets. The (unique) n-face is defined 
to be the entire polytope. 

The convex hull of a set A is the smallest convex set containing A. If A is a 
finite set, its convex hull P is the set 
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Equivalently, Eq. ([l|) can be used as a definition of a convex polytope—that is, a 
convex polytope is the convex hull of a finite set. It is a fundamental theorem in 
the field that these two definitions of convex polytopes are equivalent. Figure [3 
shows the convex hull of a finite set in R.^. 



Figure 2. The convex hull of the set X\ a polytope with four 
0-faces, four 1-faces, and one 2-face. 

The boundary of a polytope is the union of its facets and contains all faces except 
the n-face. If the polytope P is the convex hull of a set X of feasible solutions, then 
the boundary of P contains all optimal solutions (i.e. all vectors that minimize 
Ax for non-zero A). Thus, studying the polytope rather than the original set X 
simplifies the optimization problem. 

In practice, however, computing the convex hull of the feasible set X is typically 
a hard problem. We usually know too little about the set of feasible solutions 
to construct the convex hull directly. In computational biology, it is usually the 
case that the feasible set is defined implicitly by some intrinsic rules of the linear 
programming problem. The real challenge, then, is to extract information about 
the feasible set from these intrinsic rules. In this chapter, we consider two different 
methods for constructing the convex hull: polytope propaga tion and incremental 
convex hull. We will discuss these methods further in Section 

The dual object to the polytope is its normal fan. The normal fan is a collection 
of regions of R” called cones. If F is a face of the polytope, the cone corresponding 
to F, denoted Cf, is the set of all vectors B such that any point y G F minimizes 
the product Bx —that is, 

Cf = {B e R" : By < Bx for all x e P, y S F}. 

The normal fan is represented visually, as in Fig.[^, by drawing the vectors B G Cp 
with their heads at the face F. 

The normal fan of a polytope of dimension d is a partition of the R'^; every 
d-vector lies in some cone, and the cones are preserved by positive scalar multipli¬ 
cation. The cones from Fig. [^, for example, if drawn so that their heads are the 
origin, clearly fill the entire space R^, as shown in Fig. 0. 

Once we have constructed the convex hull of feasible solutions for a particular 
LP problem, the normal fan provides ways to discuss the stability of a parameter 
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Figure 3. The cones corresponding to the vertices of polytope 
P, drawn “pointing into” their vertices. Also shown: the normal 
vectors to each 1-face of the polytope. 


vector A. If A is near the boundary of a cone of the normal fan, then a small 
change in A will result in a different point of the polytope optimizing the function. 
However if A is far from any boundaries of its cone, then small changes will not 
change which point of the polytope optimizes Ax. 

In Section y we use the NNTM free energy calculation as our objective function 
and the space of all feasible secondary structures over a given RNA sequence as the 
point set. Treating this objective function as a linear functional and applyling the 
machinery of linear programming produces what we term the branching polytope, 
which we use to analyze the parameters in the model related to multibranch loops. 



Figure 4. The normal fan of the polytope fills 


3. RNA BRANCHING POLYTOPE 
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3.1. NNTM. In Sections [ll and [j^ . we introduced the discrete optimization par¬ 
adigm for RNA secondary structure prediction. Before we consider applications 
of geometric combinatorics for parametric analysis of this method, we review it in 
greater detail. 

Discrete optimization methods for RNA secondary structure prediction derive 
their objective function from the NNTM. If we make the standard simplifying as¬ 
sumption to forbid pseudoknots, we can formulate the free-energy minimization 
problem recursively; dynamic programming can then be used to find the minimum 
energy and a traceback algorithm can determine a corresponding structure. 

The specific objective function used for NNTM analysis has evolved substantially 
over the years (cf. [221 EH 02] ), with a significant increase in the number of parame¬ 
ters. The Turner99 version considered here, obtained from the Nearest Neighbor 
Database |33j . has over eight thousand parameters representing the energy contri¬ 
butions of various small substructures—some 300) measured directly through 

experiments, others 7000 inferred indirectly from experimental data, and finally 
a handful through machine learning techniques to tune the model. Among those 
inferred through machine learning are three connected to multibranch loops, such 
as the one at the center of Fig. [l|. 

In particular, the Turner99 version of the model we study assigns to a given 
multibranch loop the energy 

(2) AGinitiation = o + & ' (# Unpaired nucleotides) -b c • (# branching helices) 

where a, b, and c are three of the learned parameters. The free energy changes 
associated with multibranch loops have in fact been studied experimentally 0 
However, these results do not map neatly onto the linear function from Eq. 
which is a computational simplification—a logarithmic dependence on loop size 
would be more biophysically accurate, but the dynamic programming approach 
requires a recursive decomposition that would not be possible with such a function. 
Recent work has led to the development of new approaches to loop energy estimation 
[3H1III, which rank structures more accurately but do not translate neatly into the 
discrete optimization setting. 

Rather than focus on specific values of parameters, we investigate the behavior 
of the model as a function of the parameters, hoping to gain some insight into 
(for example) how sensitive the model is to each one and whether certain families 
of sequences exhibit increased or reduced sensitivity. If we could compute the 
collection of all secondary structures T for a given sequence S, we could attack 
this question directly. Unfortunately, this collection is far too large to compute 
explicitly—the total number of secondary structures on n bases is known to grow 
exponentially with n [2Hj) and specific sequences of biologically-relevant lengths 
may have 10®° or more possible structures—so we need to trim the problem down 
in scope. 

To do this, we mathematically reformulate the NNTM as a simple linear func¬ 
tional focusing on the branching loop energy calculations. For a given RNA se¬ 
quence, let r be a secondary structure with x multibranch loops, y unpaired nu¬ 
cleotides in those multibranch loops, and z helices branching from those multibranch 
loops. Let ogg = 3.4, 5gg = 0, and cgg = 0.4 be the values of the parameters a, 
b, and c in the Turner99 assignment. Then the energy AG associated to the 
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secondary structure T is given by 

(3) AGt = aggx + bggy + CggZ + W 


for a “leftover” energy value w which can be computed using the model. We call 
the vector {x,y,z,w) the (branching) energy signature of the secondary structure 
T. Crucially, this energy signature does not actually depend on the Turner99 
values of a, h, and c; the numbers x, y, and 2 are simply the integer counts of 
certain substructures, while the leftover energy w represents the non-multibranch- 
loop components of the energy calculation including the energy of the helices that 
meet the multibranch loops. Treating all other energy components as a single 
variable w allows us to focus on the multibranch parameters exclusively while also 
linearizing the optimization problem. 

Let a, b, c, and d be four arbitrary rationale. For a given secondary structure T 
on a given RNA sequence, we can compute an associated parameterized energy: 


(4) 


AGt (a, b, c, d) = ax + by -\- cz -\- dw 


for the energy signature {x, y, z, w) of T. 

Given a particular parameter vector (a, b, c, d) and RNA sequence S, we can find 
the secondary structure T which minimizes Eq. fl) us ing existing software; the 
details of this computation are discussed in Section l3.3l . However, to understand 
how the model behaves for all possible values of the parameters a, b, and c, we 
will need to bring the powerful machinery of linear programming and polyhedral 
geometry to bear on the problem. 


3.2. Polyhedral methods and convex hulls. Although the collection of sec¬ 
ondary structures for a given sequence is enormous, it is nevertheless finite, and its 
corresponding energy signatures occupy a bounded region of R.'^. We thus shift our 
attention to this region—specifically, to the convex hull of the collection of signa¬ 
ture vectors for the sequence which we refer to as the branching polytope. Unlike a 
classical LP problem, in which the feasible region is explicitly defined by inequal¬ 
ities, the collection of all branching signatures is implicitly defined by the NNTM 
and the RNA sequence. As a result, we have little information about the branching 
polytope. On the other hand, we have a dynamic programming algorithm that can 
find an energy-minimizing RNA configuration for any particular set of parameters. 
Hence it is natural to try to construct the convex hull of the signature vectors using 
this dynamic programming algorithm. 

The first approach we consider is polytope propagation m- Polytope prop¬ 
agation has been used in parametric analysis of various problems, such as DNA 
sequence alignment and hidden Markov model for gene annotation [53]. This al¬ 
gorithm computes the convex hull by recursive convex hull and Minkowski sum 
computations on unions of polytopes. The recursions are related to the dynamic 
programming algorithm for calculating MFE structures. Namely, the dynamic pro¬ 
gramming algorithms we use involve decomposing a problem into a sum of smaller 
problems, each of which is a minimization problem involving only addition. The 
minimization operations can be translated into taking convex hulls of unions of 
polytopes, while the sums can be translated into polytope Minkowski sums, trans¬ 
porting the algorithm directly into the geometric domain. 

However, while the translation from dynamic programming to polytope algebra is 
mathematically very elegant, polytope propagation might not be the most efficient 



BRANCHING POLYTOPES FOR RNA SEQUENCES 


9 


way to compute the RNA branching polytope. The intermediate polytope compu¬ 
tations required in this approach (convex hulls and Minkowski sum) are expensive 
for complicated polytopes, and these costs quickly add up. In fact, empirical results 
show that polytope propagation is often outperformed by incremental convex hull 
approaches, especially for high-dimensional models [ 3 . In addition, applying the 
polytope propagation approach would require re-implementing the optimization al¬ 
gorithm from scratch without taking advantage of the existing fast, peer-reviewed 
software for NNTM prediction. 

As a result, we instead use a variant of the Beneath-and-Beyond approach. The 
core idea of Beneath-and-Beyond was introduced by Griinbaum in m as a method 
to find the facets of the convex hull of a given point set. Huggins m developed 
his iB4e algorithm, which applies this idea to specialized LP problems such as 
ours. The basic idea of iB4e method is to build the convex hull incrementally— 
that is, to add one vertex at a time to an already constructed convex hull, by 
systematically solving LP problems to either expand the hull or confirm that some 
part of it matches the final hull. The objective vectors are chosen so that after each 
iteration, either a new vertex of the convex hull is found or a facet of the convex 
hull is confirmed. This ensures that the method requires running the LP solver for 
no more than 0{V -\- F) objective vectors, if the convex hull has V vertices and F 
facets. The algorithm iB4e applies to abstract convex-hull-finding problems in any 
dimension, so we describe it here in full generality. 

Before the main loop of the algorithm can be applied, we must perform an 
initialization step which finds a collection of points whose affine hull is the same as 
the affine hull of the branching polytope we aim to construct. (Several approaches 
to this step are possible; for example in M" one might run the LP solver with the 
2n basis vectors (±1,0,..., 0), (0, ±1,..., 0),..., (0,..., 0, ±1).) We then compute 
the convex hull of these points and label all facets of the convex hull as ‘tentative,’ 
as illustrated in Fig. 

We then begin the main loop of the algorithm. At each step, we pick a tentative 
facet, use its outer normal vector as objective function, and apply the LP solver with 
this objective. If the signature of the optimal RNA structure for these parameters is 
not outside the known hull, as illustrated in Fig. ISbI. that facet becomes ‘confirmed’ 
and the process is restarted with another tentative facet. However, if the resulting 
signature is outside the known hull, as illustrated in Fig.js^, we add it to the vertex 
set, compute the new convex hull, and label the newly added facets as tentative. 

The process is repeated until all facets are confirmed. In the case of a particular 
RNA sequence, the set of all possible secondary structures is finite and thus set of 
signatures is also finite. The process must terminate and its end result is the convex 
hull of the set of all solutions to the LP solver. This is the branching polytope, 
and once we have computed it we can easily compute its normal fan using standard 
methods and proceed to study the model. 


3.3. Computing the branching polytope with pmfe. It is conceptually natu¬ 
ral to apply the Beneath-and-Beyond algorithm to compute the branching polytope 
of a particular RNA sequence by treating the parametrized NNTM minimization 
problem as a linear program. However, coupling the two algorithms to perform 
the computation requires some care. We introduce the software pmfe, available 
publicly at https://github.com/AMS-MRC-disc-math-bio/pmfe, Instructions to 
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(a) Initialization (b) Confirming a tentative facet 



(c) Extending a tentative facet 

Figure 5. The Beneath-and-Beyond algorithm for polytope construction 


build and install the software are provided in the hie README.md of that reposi¬ 
tory. Alternatively, a ready-to-use version of the software is available in a Docker 
container, posted publicly on the Docker hub as agdphd^mf e. 

The pmfe package has three components (see Fig. [hi): iB4e, findmfe, and 
scorer. 

• iB4e is an implementation of Huggins’ algorithm as a header-only template 
library in C-I--I-, taking advantage of the CGAL computational geometry 
library |30j to handle the geometric computations. 

• findmfe takes a parameter vector P = (a, 5, c, d) and an RNA sequence 
S as input and returns as output a secondary structure on S which has 
minimal energy with respect to P. From the available software packages 
which implement NNTM optimization, we chose to base findmfe on GTf old 
m because it is parallelized and thus able to take advantage of modern 
multi-core desktop computers. Some modihcations were required to adapt 
it to our use. For computational efficiency, GTf old performs all arithmetic 
with hxed precision of two decimal places; we converted it to use arbitrary- 
precision rational arithmetic based on the GMP library m, which increases 
running time by an order of magnitude but ensures that the results are 
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exact even when parameters are not multiples of 1/100. We also added 
an interface to modify the multibranch parameters (a, b, c) and the dummy 
scaling parameter d programmatically, allowing the Beneath-and-Beyond 
algorithm to treat GTfold as a function of these four variables and call it 
repeatedly. 

• scorer takes an RNA sequence S and a secondary structure T and returns 
the signature vector {x,y,z,w) of T according to Eq. (Q). (In particular, 
it computes the value of w by computing the energy decomposition of T 
under the Turner99 parameters.) 

We combine these three components with the control flow illustrated in Fig. 0 
to create the pmfe software package. Given an RNA sequence S', the iB4e module 
repeatedly generates parameter vectors according to Huggins’ algorithm. Each of 
these vectors is sent to the f indmfe module as a query, which generates a secondary 
structure with minimal energy. Each such secondary structure is sent to the scorer 
module to find its signature vector, which is then sent back to iB4e as the response 
to the query. After some number of iterations of this loop, the convex hull of these 
signature points s is found to be equal to the branching polytope, which is returned 
to the user. (For convenience, we in fact provide the user with a file containing 
both the signatures s and their associated structures T in condensed dot-bracket 
notation.) 



Figure 6. Control flow of the pmfe software 

Since the software pmf e which we introduce is complex and extensively modifies 
the peer-reviewed codebase of GTfold, we provide a testing framework to ensure 
correctness. Tests are available to ensure that pmfe returns the same structures and 
scores as GTfold, using both the classical Turner99 parameters [m and various 
modified parameter sets for a variety of natural and synthetic RNA sequences. 
Details for running these tests are available in the pmfe repository. 

We also provide a module rna_poly in the SageMath computer algebra system 
[2U] which can be used to study the branching polytopes produced by pmfe. This 
can be used to generate a variety of visualizations of the normal fan of the branching 
polytope. In particular, it provides a simple interface for taking the d = 1 slice of the 
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fan, giving a polyhedral partition of whose regions are collections of parameters 
(a, 5, c) which yield the same secondary structure for the sequence under study. The 
results are illustrated in Fig. 0, which shows the 6 = 0, d = 1 slice of the normal 
fan of the branching polytope for a H.sapiens tRNA sequence. The Turner99 
parameter values ogg = 3.4, Cgg = 0.4 are marked with a circle just northeast of the 
origin. 

H. sapiens tRNA 



multibranch loop initiation penalty (a) 

Figure 7. The intersection of the parameter space decomposition 
for a H.sapiens tRNA sequence with 6 = 0. 


3.4. Biological questions leading to mathematical problems. The RNA branch¬ 
ing polytope and the associated polyhedral decomposition of the three-dimensional 
multibranch parameter space obtained by intersecting its normal fan with d = 1 
encapsulate the dependency of the optimization on the parameters a, 6, c. Various 
questions related to accuracy and stability can be addressed through the analy¬ 
sis of the polytope and the polyhedral decomposition. However, answering such 
questions in a manner that is biologically relevant requires consideration of certain 
subtleties and this leads to interesting mathematical problems. Here we discuss a 
few biological questions that could be addressed through the parametric analysis 
and related mathematical problems that arise. 

Question 1: How robust is the optimization for a given RNA sequence? 

In the strictest sense, the prediction is sensitive if a variation of the parameters 
within an allowed margin of error produces different optimal secondary structures. 
This occurs if the distance of the Turner99 parameters (agg, 6gg, cgg) from the 
boundary of the polyhedron that contains this point is smaller than the allowed 
margin of error. For example, Fig.0 shows the position of (agg, 6gg, Cgg) within the 
polyhedral decomposition of a H.sapiens tRNA (only the 6 = 0 slice is depicted, 
which is the baseline value in Turner99). Its distance to the boundary is 0.13699 
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and, for instance, optimization using a = 3.39130, b = —0.14786, c = 0.37391 yields 
structures with a different signature. 

However, this by itself does not mean that the optimization is very sensitive; 
analysis of the structure space is required so that one can quantify the sensitivity. 
Namely, two structures with different signatures may still have a lot of structural 
similarities (for example, very similar long helices and branching pattern), in which 
case one would not necessarily say that the prediction is very sensitive. Moreover, 
since the structure-to-signature map is not one-to-one, when quantifying the struc¬ 
tural differences in the optimal structures one actually needs to compare two sets of 
structures, each one corresponding to a given signature. In short, in the assessment 
of robustness we face the following problem. 

Problem 1: Define a good representative of the set of structures that correspond 
to the same signature. 

The problem of representing a whole set of structures by a single one has ap¬ 
peared before, for example in the context of compact representation of a sample 
of structures. However, the methods developed for this (e.g. sfold clustering [7] 
and profiling m) usually assume structural similarities of the elements in the set. 
Here, though, we are presented with a different set-up: the structures mapping 
to the same signature do not a priori have any common motifs. Therefore, choos¬ 
ing one consensus structure as the other methods do may not yield a reasonable 
representative of the whole set. 

Question 2: How much can the prediction of secondary structure be improved 
for a given sequence? 

For sequences for which the native structure has been obtained experimentally 
or via comparative sequence analysis, one could assess the accuracy of the NNTM 
by comparing it to the MFE structure. In fact, since the structures corresponding 
to all signatures on the boundary of the branching polytope can in practice be 
obtained fairly efficiently one can precisely determine branching parameters 
that would yield a signature corresponding to a structure that is closest to the 
native one. However, here we again face the problem that besides the most accu¬ 
rate one, that signature corresponds to a whole set of MFE structures that might 
be very different from the native structure. In such a case, the almost accurate 
structure might be unrecognizable. Therefore, in assessing accuracy, the problem 
of finding a good representative structure for a given signature is still very relevant. 
Furthermore, even if we can solve this problem, the most accurate prediction ob¬ 
tained this way may require a triple of parameters (a, 5, c) that is very different from 
(ugg, 6gg, Cgg), which may not be very desirable. Namely, even though (ogg, 6gg, Cgg) 
were not obtained experimentally, they still might have some biological relevance 
since the initiation point for the genetic algorithm used was suggested by stabilities 
determined by optical melting for an RNA multibranch loop with three branching 
helices m- This suggests the following natural problem. 

Problem 2: Maximize the improvement in accuracy while minimizing parameter 
change. 

One mathematical subproblem here is to formulate an objective function which 
appropriately measures the difference between two RNA secondary structures. While 
there are simple structure metrics that can be very efficiently computed (e.g. the 
base pair metric based on symmetric set difference), other metrics may be more 
appropriate to quantify the structural differences |21j . 
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Another important subproblem is formulating an appropriate metric to measure 
changes in parameter values. Standard metrics such as the p-metrics can be ap¬ 
plied, since the parameter vectors are taken from but it is not clear which is 
most appropriate. Furthermore, careful consideration should be given to how each 
component of the parameter vector is weighted: a unit change in each coordinate 
may have significantly different impacts on the secondary structure. For example, 
since a given multibranch loop in a structure has several unpaired nucleotides, a 
small change (e.g. by 0.1) in b in general has bigger effects on the prediction than 
the same change in a. 

Question 3: Is there a choice of branching parameters that would improve the 
prediction accuracy for a family of sequences? 

Homologous sequences perform the same function in different species and they 
fold into essentially the same structure. However, their MFE prediction accuracies 
can vary significantly. For instance, within the tRNA family, the accuracy calcu¬ 
lated as the F-measure (the harmonic mean of the MFE sensitivity and positive 
predictive value against true positive base pairs) ranges from the minimal 0 to 
the maximal 1 m- Hence, it is natural to try to find a set of parameters which 
yields the highest accuracy for a family of sequences. For this, one needs to address 
the question: Are there any patterns in the dependency of the prediction on the 
parameters across a family of homologous sequences? Mathematically, one could 
phrase this as a problem of finding similarities between polytopes in R^ or between 
polyhedral decompositions of R^. For example. Fig. [s] shows the 5 = 0 slice of 
the polyhedral decompositions for a tRNA sequence of L.delbrueckii. By visually 
comparing this figure to Fig. 0 one can notice similarities in the directions of the 
splitting hyperplanes. The challenge of course is to quantify such similarities. So, 
we face the following problem. 

Problem 3: Find a way to compare branching polytopes. 

Comparison of polytopes is a problem that has appeared before, for instance in 
relation to the traveling salesman problem in optimization [9]. Even in optimiza¬ 
tion problems when one works with relaxation polytopes that are, loosely speak¬ 
ing, “close,” comparing relaxations is not well understood and different comparison 
methods have been suggested (e.g. m)- In our case, an additional challenge is the 
fact that the polytopes to be compared do not a priori satisfy any mathematically 
tractable assumptions for closeness. 

4. Conclusions 

The affine energy function which governs the branching of an RNA secondary 
structure under the NNTM optimization is a known weakness of the thermodynamic 
model, and the methods from geometric combinatorics outlined here offer signifi¬ 
cant potential to assess its accuracy and stability through a parametric analysis. 
While simplified models are certainly more tractable, the new pmfe computational 
framework makes possible the analysis of branching polytopes for real RNA se¬ 
quences for the first time. As illustrated by the qualitative similarities between 
Fig. 0 and Fig.0, this approach is capturing some interesting characteristics which 
are preserved between two different sequences. Moreover, as discussed, moving be¬ 
yond the qualitative comparison of polytope slices to quantify their similarities and 
differences presents some interesting mathematical challenges. Consequently, it re¬ 
mains to be seen whether the observed patterns in these proof-of-principle results 
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Figure 8. The 6 = 0 slice of the parameter space decomposition 
for a L.delbrueckii tRNA sequence. 


are resulting from the structure of the thermodynamic optimization and/or the cod¬ 
ing of structural motifs in the base pairing of these RNA sequences. Resolving this 
dichotomy will reveal much about the interplay between mathematics and biology 
in the prediction of RNA secondary structures. 
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